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Summary Systems involving many variables are important in population and quantitative genetics, 

for example, in multi-trait prediction of breeding values and in exploration of multi-locus 
associations. We studied departures of the joint distribution of sets of genetic variables 
from independence. New measures of association based on notions of statistical distance 
between distributions tire presented. These are more general than correlations, which are 
pairwise measures, and lack a clear interpretation beyond the bivariate normal distribu- 
tion. Our measures are based on logarithmic (Kullback-Leibler) and on relative 'distances' 
between distributions. Indexes of association are developed and illustrated for quantitative 
genetics settings in which the joint distribution of the variables is either multivariate 
normal or multivariate-£. and we show how the indexes can be used to study linkage 
disequilibrium in a two-locus system with multiple alleles and present applications to 
systems of correlated beta distributions. Two multivariate beta and multivariate beta- 
binomial processes are examined, and new distributions are introduced: the GMS- 
Sarmanov multivariate beta and its beta-binomial counterpart. 

Keywords association indexes, entropy, gene frequencies, Kullback-Leibler distance, 
linkage disequilibrium, multivariate beta distributions, pleiotropy 



Introduction 

The analysis of systems involving several variables is of 
great importance in population and quantitative genetics, 
for example, in multi-trait prediction of breeding values or 
in the study of networks using multi-locus data. The prob- 
lem addressed here is that of measuring departures of the 
joint distribution of sets of genetic variables from stochas- 
tic independence. The target variables could be, for 
instance, pleiotropic additive effects of many loci under 
the infinitesimal model of Fisher (1918), correlated allelic 
frequencies of loci within bins of physical distance defined 
by numbers of base pairs, or sets of allelic frequencies that 
vary at random over clusters of individuals or sub-popula- 
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tions. Likewise, a study may involve systems of alleles 
whose joint distribution departs from independence, due 
either to proximity within a chromosome (linkage) or to 
forces such as random drift or selection favoring epistatic 
combinations, which may also create associations among 
loci on different chromosomes. When the problem is that 
of the study of the statistical association between allele 
states at different loci, this usually is referred to as gametic 
or 'linkage disequilibrium' (LD). with pairwise correlation 
measures typically employed for characterizing such asso- 
ciation. If alleles at two loci are associated due to physical 
linkage, equilibrium (independence) is eventually attained 
asymptotically as generations of random mating accrue. 
LD has received enormous attention in population genetics 
(e.g., Hill 1974: Hedrick 1987; Lewontin 1988; McVean 
2007) and has become an increasingly important topic of 
research in the light of emerging genomic data. e.g. single 
nucleotide polymorphism markers and sequence data 
enable the study of joint evolution of genomic blocks as 
well of as genome-wide association with complex disease 
or quantitative traits. This last area is one in which 
Professor Morris Soller has made important contributions, 
for example, a pioneering paper by Soller & Genizi (1978) 
in marker-assisted selection, and his 1990 work with 
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Weller and Kashi on designs for inferring linkage between 
markers and quantitative trait loci in dairy cattle. The 
present paper is in his honor. 

The objective of this work is to introduce new measures 
of association involving a system of variables based on 
notions of statistical distance between distributions. The 
resulting metrics are more general than correlations, 
which are pairwise measures only and lack a parametric 
interpretation beyond the bivariate normal distribution (e. 
g., Lewontin 1988), especially when associations are non- 
linear or when pairs of variables are neither independent 
nor identically distributed. For example, in the study of 
gametic disequilibrium, the distribution of correlation- 
based estimators is frequency dependent (Hedrick 1987), 
thus complicating comparisons among populations and 
meta-analyses. Also, there are situations such as in a mul- 
tivariate t-distribution with a diagonal covariance matrix in 
which variables are uncorrelated and, yet, stochastically 
dependent. This illustrates that lack of correlation does not 
provide definitive evidence of independence: in this exam- 
ple, variable X, say, informs about Y even though uncorre- 
lated. A related question is the inability of correlation to 
measure association between, say, 10 loci or 20 pleio tropic 
effects. The measures of association developed in this study 
are based on departure between the distributions of the 
variables under independence and association assump- 
tions. Indexes measuring association between random 
variables, irrespective of their number or of the form of the 
joint distribution, are suggested and illustrated. 

The article is organized as follows. Following this intro- 
duction, measures based on logarithmic and on relative 
'distances' between distributions, as well as indexes of 
association, are presented. Subsequently, these indexes are 
developed and illustrated for quantitative genetics settings 
in which the joint distribution of the variables under con- 
sideration is either multivariate normal or multivariate-£. 
The fourth section shows how these indexes can be used to 
study LD in a two-locus system with multiple alleles; here, 
a correlation would not have much meaning, if at all. The 
fifth section presents applications to systems of correlated 
beta distributions; all these are generalizations of the uni- 
variate beta distribution, which has been used extensively 
in population genetics to study evolution of gene frequen- 
cies. In particular, two multivariate beta and multivariate 
beta-binomial processes are discussed, and new distribu- 
tions are introduced: the GMS-Sarmanov multivariate beta 
and its beta-binomial counterpart. The paper concludes 
with a discussion of the concepts and of other procedures 
that have been proposed for study of multi-locus LD. Math- 
ematical details are relegated to an appendix. 

Measuring association among variables 

To motivate the approach used here, consider a pair X, Y 
of random variables with realized values x, y; the ideas 



generalize readily to distributions of higher dimension sim- 
ply by replacing scalars by vectors. Let p(x,y) and p(x) x p 
(i/) be the densities of the joint distributions under associa- 
tion and independence, respectively; p(x) and p(y) are the 
marginal densities of X and Y respectively. Association 
occurs whenever p(y \ x) (or p(x \ y)) differs from p(y) (or p 
(x)). Measures of dependence (association) can be derived 
from the density ratio 

a p(x,t/) = p(y\x) = p(x\y) 
x " p{x)p(y) p(y) p(x) 

If a xy = 1 for all pairs of values, the 'distance' between 
distributions is 0 and independence holds. Values of a. xy 
larger or smaller than 1 suggest statistical dependence. For 
example, a xy larger than 1 indicates 'coupling', i.e. the density 
of certain pairs of values is higher than expected under inde- 
pendence; values smaller than 1 would be indicative of 
'repulsion'. Likewise, strong departures of log a xy from 0 
would be indicative of an association between x and y. 

We discuss two measures of association based on 
either expected logarithmic distance between distributions 
or on expected relative distance, as conveyed by the 
average values of log a xy and of a xy respectively. These 
measures are used subsequently to develop indexes of 
association, irrespective of the number of variables in a 
system. 

Association based on Kullback-Leibler distance 

The Kullback-Leibler (KL) logarithmic distance (Kullback 
1968) is defined as the sum of two KL discrepancies 
between the distributions: one under association and the 
other under independence. The first discrepancy measures 
departure from association when the latter is true, and 
the other measures departure from independence when 
this attribute holds. The two discrepancies are positive by 
construction and are defined as 

T>m = Discrepancy (association — » independence) 
= ~E P (x,y)(loga xy ) = J J (logaxy)p(x,y)dxdy, 

which measures discrepancy when it is true that X and Y 
are associated, and 

T>ia = Discrepancy (independence — ► association) 

= e p{x)m( 1o £ cc w) =// ^og^p{x)p{y)dxdy, 

measuring discrepancy when independence holds; the 
range of integration depends on the random processes 
involved. If the two distributions are identical, both dis- 
crepancies are null; on the other hand, large discrepancies 
reflect stochastic dependence. Above, E p (^ y ) and 
denote expectations taken under the assumptions that (X, 
Y) follow either a bivariate distribution (association) or 
that these variables are independently distributed. 
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A connection with the likelihood ratio statistic is fairly 
direct. Suppose that the parameters of the distributions 
under independence and association are estimated by maxi- 
mum likelihood (ML), and that these estimates are regarded 
as if they were 'true' values. Then, log a xy , with the densi- 
ties evaluated at the corresponding ML estimates would be 
the log of a ratio between maximized likelihoods. In D AI one 
would be taking expectations of the log-likelihood ratio over 
all possible realizations expected if the 'full model' (associa- 
tion) were true; on the other hand, in D [A the expectations 
of the reciprocal of the log-likelihood would be taken under 
the assumption of independence. In this context, D A1 and 
D M are averages of log-likelihood ratios, as opposed to an 
evaluation at a single realization (in ML estimation parame- 
ters vary whereas observations are taken as fixed). Note 
that D AL and D [A do not involve any requirement of 'nest- 
ing' of models, contrary to likelihood ratio tests under regu- 
larity conditions, but this is a different issue. 

The discrepancies are such that D AI ^ D IA , and a unique, 
symmetric KL distance is defined (Kullback 1968) as 



KL = D AI + D U - 



(2) 



Then, an index of association (taking values between 0 
and 1) is obtained by normalizing D IA as 



Dr, 



Dm + D L 



(3) 



with 



1 - i 



D Al 



D A1 + D IA ' 



provided that KL is not null. A large value of D M relative 
to D A i (or the opposite) is indicative of situations in which 
there is more association than what would be expected 
under independence so that 9 or 1 — 9 would be close to 
1. Thus, values of 9 or of 1 — 6 away from j provide evi- 
dence of discrepancy between the hypotheses of random 
ensembles of (x,y) pairs vs. associated ensembles. 

There is no closed form for D IA or D A1 for many distribu- 
tions, in which case these metrics must be calculated 
numerically or estimated using Monte Carlo methods, given 
some parameter values. If association is weak, the two loga- 
rithmic discrepancies and KL are nearly null, and this can 
lead to very unstable, even absurd, Monte Carlo estimates 
of 9. Also, the KL distance and the two discrepancies are 
logarithmic, and this perhaps provides a less intuitive 
notion of 'distance' than one based directly on a xy . An alter- 
native 'distance' measure is presented immediately below. 



Dk = / / ^P(x)p(y)dxdy = j J ^^ dxdy (5) 



and 



D' 



D'm + D' u 



(6) 



The metric D' AI gives the expected value of a xy under 
association and DJ A gives the expectation of a~ y under the 
assumption of independence. These expectations are both 
positive, by construction, and values of 9 away from \ 
can be construed as reflecting departure from indepen- 
dence. For example, 9' = 5 or | would suggest association. 
The connection with likelihood ratios surfaces again; for 
example, Dj 4 is the average likelihood-ratio over all possi- 
ble realizations of the data under the independence 
assumption. 

In what follows, the proposed indexes are examined for 
some distributions that are used often in quantitative and 
population genetics research. 

Multivariate Gaussian and t-distributed 
variables 

In quantitative genetics, the analysis often focuses on a 
set of continuous correlated variables (X 1 ,X 2 ,.. -,X K ) that 
follow some multivariate distribution; these variates could 
be, for example, additive genetic effects for K traits. Sup- 
pose that a matrix of estimates of additive genetic correla- 
tions is available (and taken as a 'true' matrix), and that 
we seek to arrive at a single measure of association 
among the K genetic values, as opposed to the K 3 ^ pair- 
wise measures provided by the matrix of genetic correla- 
tions. Below, examples of multivariate normal 
distributions with K= 2, 3, 4 and of a bivariate t-distribu- 
tion are considered to illustrate how 9 behaves. 



Bivariate normal distribution 

Let Xi and X 2 be two standardized normally distributed 
genetic values with correlation p. For this distribution 

P 2 (l 



p 2 T 1 + 



D A1 = - log \/l - p 2 and D !A 

log \J\ — p 1 (e.g. Sorensen & Gianola 2002). When the 
correlation is null, the two discrepancies are 0; when the 
correlation is perfect, the discrepancies are both 00. Thus, 
KL = p 2 (l — p 2 ) -1 , which is 0 when the two random vari- 
ables are independent and co under a perfect correlation. 
Then 



Association based on relative distance 

Define the relative discrepancies 



D' Al = J J V(x,y)dxdy = J J ^^^dy, (4) 



Di, 



D Al + D lA 



1 + 



(1 



p 2 )logVl-P 2 



(7) 



measures discrepancy away from the independence situation 
when the latter holds, relative to the KL distance between 



the two distributions. It can be shown that lim p 2^ o (0) 
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and lim p 2^ 1 (0) = 1, providing the lower and upper 
bounds of 8 respectively. 

The values of 8 and of 1 — 8 are plotted against p in 
Fig. 1. The dotted line ('holds water') gives the relative 
discrepancy (8) away from the independence models 
when this is true, and the dashed line ('spills water') 
depicts the trajectory of 1 — 8, i.e. the relative discrep- 
ancy under association; note that a given curve is a 
mirror image of the other. As the absolute value of the 
correlation increases 6 —> 1, as one would expect. Since 
an association index with values ranging between 0 
and 1 may be easier to interpret, an alternative mea- 
sure could be 



y: 



2DiA 

Dm + D M 



1 = 28- 1. 



However, 7 takes values in [0,1] provided that 8> I, 
which holds for a bivariate Gaussian distribution, but not 
so in general. The relationship between 7 and p for this 
Gaussian model is also shown in Fig. 1 (solid line), and 
the association is suggested as weaker when measured by 
8 than when assessed with the correlation p. For example, 
if \p\ = 5, 6 is smaller than 0.15. In multivariate systems 
or in other distributions, 8 can be much smaller than \ so 
that 7 takes negative values, as illustrated below, so it 
turns out that 8 often provides an easier to interpret mea- 
sure of association than does 7. 



Multivariate normal distribution 

The metrics under consideration can be applied to distri- 
butions with any number of dimensions, leading to mea- 
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Figure 1 Measures of association of two bivariate Gaussian variables 
as a function of their correlation (p). The straight lines give the 
strength of the association as measured by the absolute value of p. 
The dotted ('holds water': 8) and dashed ('spills water': 1 - 9) lines 
depict the relative contributions to the Kullback-Leibler distance due 
to discrepancies under independence and dependence models, 
respectively. Values of the association measure y=26 - 1 are repre- 
sented by the dark solid line. 



sures of departure from independence in more general 
situations, such as when alleles in a multiple-locus system 
are studied. To illustrate, consider the K-dimensional mul- 
tivariate Gaussian distribution (x~N(0,R), where variates 
are in standard deviation units so that R is a correlation 
matrix; under independence (x~N(0,I). It can be shown 
(Penny & Roberts 2002) that 



KL = -tr(R~ 1 
2 1 



■ R - K, 



(8) 



Where £r(.) denotes the sum of the diagonal elements of a 
squared matrix such as R. Further, 



and 



so that 



D AJ =i[tr(R)-log|R|-K], 



DM = -[log|R|+tr(R- 1 ) -K] 



^^(R- 1 +R) — K 



(9) 



(10) 



fll) 



Examples of several multivariate normal distributions are 
given next. 

Equi-correlated trivariate normal distribution 

Let three variables be equi-correlated with correlation p, 
so that 



R 



with 



|R| = 2p 3 - 3p 2 



trCR) 



and 



^-(R- 1 ) = 3(p + 1)/(1 + p - 2p 2 ). Using (8)-(ll) yields 



D A; = -log^(p-l) / (2p+l) 
3P 2 



Dm =■ 



KL: 



1 + p - 2p- 
3P 2 



— Dai 



l + p-2p 2 



and 



] (p-l)(l + 2p)log v /(p-l) 2 (l + 2p) 
3p 2 



fl2) 



With 0<8<l. For example, for p = 0.25, 0.50 and 
0.75, the index of association 8 takes values 0.49. 0.54 
and 0.66 respectively. Figure 2 depicts 8, 1 — 8 and 8 as a 
function of the coefficient of correlation. At p = 0, 8 = 0.5, 
but then it decreases slightly as the correlation increases 
such that 7 does not attain a positive value until 7 
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Correlation Correlation 



Figure 2 Indexes of association as a function of the coefficient of 
correlation in an equi-correlated trivariate normal distribution. The 
thick solid line gives the relative Kullback-Leibler discrepancy between 
distributions when independence holds (0), and the dashed line gives 
the relative discrepancy when association is true (1 - 9). The thin line 
gives the trajectory of y=26 - 1 . 



Figure 3 Discrepancies from independence to association (9, solid line), 
and from association to independence (1 - 9, dashed line) as a fraction 
of the Kullback-Leibler distance between two tetravariate normal distri- 
butions. The straight lines give the absolute values of the correlation p. 



Let now K = 4 and take as correlation matrix 



reaches a value of about 0.3. Note that 8 is not symmetric 
with respect to p (the same is true of |R|); for instance, 
when p = — \, 8 = 0.59 so that association would be 
viewed as stronger than when p = r, since 8 = 0.49 then. 
This is a consequence of the values that the discrepancies 
D IA and D AI take at varying values of p. 

Tetra-variate normal distributions 

Alternative measures can be derived from eigenvalues of a 
correlation matrix; however, this is sensible only for 
distributions in which linear relationships between vari- 
ables are meaningful. The index 8, derived from notions of 
statistical distance, does not have this limitation because 
KL discrepancies have a general meaning and can be cal- 
culated for any distribution, at least numerically, given 
values of the parameters. Naturally, any single measure of 
association in a multivariate distribution becomes less 
transparent as the system of variables increases in dimen- 
sion. 

Let K = 4 and suppose the variables are equi-correlated 
with coefficient p. Then, |R| = -(p - l) 3 (3p + 1), 
trfR- 1 ) = -(8p + 4)/[(p - l)(3p + 1)] and 



l-(p-l)(l + 3p)- 



io g y- (p- + 3p) 

6p 2 



(13) 



Figure 3 displays how 8 and 1—9 vary against p. At 
8 = 0 one has 0=1 — 0 = \; 0 decreases somewhat subse- 
quently and then increases, attaining 0.5 again at about 
p = 0.45. The values of the index of association suggest 
that the independence and association models are not 'too 
distant' unless the correlation is below —0.10, if negative, 
or above p = 0.50, if positive. 



R 



1 0.9r 0.5t 0.1t 

0.9t 1 0.2t 0.1t 

0.5t 0.2t 1 0.1t 

0.1t 0.1t 0.1t 1 



where 0 < r < 1, so that when t = 0 the four variables 
are independent. For t=l, log |R| = —2.5459 and tr 
(R -1 ) =24.8980, yielding 8 = 0.8782. Likewise, for 
t = 1, log |R| = -0.2960 and tr(R _1 ) = 4.6540, so that 
8 = 0.5473. This illustrates that the strength of the asso- 
ciation is proportional to the strength of inter-correlation, 
and that 8 provides a metric for measuring departure from 
independence. 

We finish this section by posing the following question: 
what does 8 say that is not informed by all possible pair- 
wise correlations? Again, let K = 4 and take as correlation 
matrices 



Ri 



;R 2 



1 


0 


0.6 


0 


0 


1 


0 


0.6 


0.6 


0 


1 


0.6 


0 


0.6 


0.6 


1 



The salient point is that the correlation structure is 
such that while pairs of variables (1.3), (2.4) and (3,4) 
are correlated, (1,2) are independent. 

What is the overall strength of the association in the 
system? Calculations yield 6*i = 0.54 and 8 2 = 0.91 for the 
networks of variables characterized by p = } and p = yjy , 
respectively. While the correlations measure association 
between pairs of nodes, 8 gives an indication of the overall 
degree of association. 



Bivariate f-distribution 

As mentioned, there are situations in which random vari- 
ables are jointly dependent, but yet, their correlation is 0, 
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which would incorrectly suggest that X does not inform 
about Y. This highlights limitations of the correlation 
parameter as a metric for statistical dependence between 
sets of variables. One such case is the multivariate- £ distri- 
bution, in which variables can be statistically dependent, 
yet uncorrelated. The univariate and multivariate t-distri- 
butions appear in robust linear regression models for quan- 
titative genetic analysis based on pedigrees (e.g., Stranden 
& Gianola 1998; Sorensen & Gianola 2002; Rosa etal. 
2003). The univariate-t distribution has also been used 
as a prior in Bayesian linear regression models for 
genome-enabled prediction of single traits (Meuwissen 
etal. 2001; Gianola etal 2009). 

Suppose two variables, e.g. SNP effects on a pair of 
traits, follow a bivariate t-distribution with a null mean 
vector, v degrees of freedom and a scale matrix that is 
equal to a 2 x 2 identity matrix; here, the two effects are 
uncorrelated but not independent, as shown in the Appen- 
dix. For this situation, the density ratio (1) is 



r(^)r(f)[i + ; 



■[( 



i+- 



so that the relative discrepancies in (4) and (5) become 



Al 



- , p(xi,x 2 \t>) < 



[l + ' 



and 



iy, A 



r 2 m 

re?)r(t) 



-mi 



(i+^) 



-m 



The expectations above do not have closed forms but 
can be evaluated using Monte Carlo methods: given the 
value of the degrees of freedom parameter, random num- 
bers can be drawn from a bivariate t-distribution and from 
two independent univariate t-distributions, to approximate 
D' AI and D' M . 



The index of association 9' in (6) was computed for 
t-distributions with ^ = 1,2, 4, 6,8, 20 and 100 000. The 
last value produces an approximation of the joint distribu- 
tion of two independent normal variates, whereas the 
setting with /i = 1 evaluates distance between a 'bivariate 
Cauchy' distribution and the distribution of two identically 
and independently distributed Cauchy random variables. 
The t-processes with \i = 1, 2, 4 do not have a finite 
variance, but the distributions are proper (i.e., the density 
integrates to 1). As shown in Table 1, with sets of 1 million 
random numbers used for evaluating D' AI , D' IA and 8', the 
'distance' D' A1 from association to independence decreases 
as v increases, approaching 1, whereas the distance from 
independence to association D' IA increases, approaching 1 
as well. With ;y= 100 000 the random variables are 
essentially independent since Q 1 = 1 — & = \ , leading to 7' 
= 0, the setting v = 1 produces the strongest possible 
association, with 8' = 0 and both 1 — 6' and 8' equal to 1. 

Linkage disequilibrium 

Suppose there are C and R alleles at locus A and locus B, 
respectively, such that data on gametic types can be 
arranged into an Rx C contingency table. Let py be the 
probability of observing a haplotype with configuration 
'if. If n gametes are screened and the observed number 
having such configuration is «y, the assumption of multi- 
nomial sampling leads to 



p(x)p(y) 



R C 

TT TT p"" 

nn,n 2 i,...,n R cJ ji^y l > 



n 



"11, "21, 



„ , n n (pa*.)* 

, "RC/ i=l;=l 



R C 

nn^ 

_ i=ij=i 

n n (pmr> 

i=l;=l 

where Pi. = Yli=iPih an d Pi = ^2i=iPii are the marginal 
probabilities of alleles '{ and ';"at loci A and B respectively. 
Define parameters D tl =py — PfPj., measuring departure 
from independence (i.e. disequilibrium between alleles i 
and ;). In a two-locus model a single D parameter is 



Table 1 Departures of zero-mean bivariate t-distributions with a 2 x 2 identity matrix as scale parameter (so their correlation is null) from 
independence as measured by D' AI , D', A and by the indexes of association 6' = D' !A /(D' A i + D', A ), 1-0' and 7' = 2(1 - 8 1 ) - 1 (when 8' = \ the 
random variables are independent) 



Degrees of freedom 


D' AI 


D' IA 


D' = D' AI + D' IA 


6' 


1 - 8' 


y 


1 


7.28 x 10 9 


0.4376 


7.28 x 10 9 


0.0000 


1.0000 


1 .0000 


2 


63438837 


0.6424 


6348838 


0.0000 


1 .0000 


1 .0000 


4 


20.7443 


0.7947 


21.5390 


0.0369 


0.9631 


0.9262 


6 


2.6299 


0.8560 


3.4859 


0.2456 


0.7544 


0.5088 


8 


1.2715 


0.8889 


2.1604 


0.4115 


0.5885 


0.1770 


20 


1.0584 


0.9528 


2.0112 


0.4738 


0.5262 


0.0524 


100 000 


1.0000 


1 .0000 


2.0000 


0.5000 


0.5000 


0.0000 
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needed, and by using p ij = D + p L pj, then a xy would be the 
ratio between two likelihoods: the 'unrestricted' likelihood 
(indexed by parameters D, p L , Pj) and the 'restricted' likeli- 
hood (indexed by the two allelic frequencies). The asymp- 
totic connection between the standard log-likelihood ratio 
statistic and the \ 2 metrics used for multi-locus analysis 
of LD by Hill (1975) and more recently by Zhao etal. 
(2005) is given, for example, in Agresti (2002). 

Under this sampling scheme, the discrepancies become 



Dm 



;=i j=i 



and 



i=l ;=1 \ Pi i 

Hence 



i=i j=i V Pv 



R C 

EEft-P;. log 



Dm _ i=i j=i 
D IA + D AI ~ » c 



Naturally, 6 is not defined when all Dy = 0 because then 
the two distributions would be identical and the KL distance 
is 0 in that case. Note that D A[ is the expected value of the 
log-likelihood ratio, but under the assumption of associa- 
tion; on the other hand, when the sampling distribution of 
the log-likelihood statistic is considered, the reference distri- 
bution is the null model, i.e., linkage equilibrium. This illus- 
trates an important conceptual difference, apart from the 
fact that D [A and D^, involve integration over the sampling 
space of allelic outcomes at the two loci in question. 

To illustrate, consider a hypothetical example, patterned 
after data in Spiess (1989) reproduced partially by 
Frankham etal. (2010). The data pertain to three and 
four alleles at the major histocompatibility loci HLA-A and 
HLA-B respectively; the data, after a modification 
explained subsequently, are shown in Table 2. Frankham 
etal. (2010) presented only the data for these alleles, yet 
there are many more variants at these loci. Relative fre- 
quencies in Frankham et al. (2010) were 'normalized' 
such that the 12 joint frequencies would add up to one. 

Taking haplotype frequencies p r) as true values, the dis- 
crepancy of the independence model away from associa- 
tion is 



R c / 

iog(- 

i=i i=i VP' 



108 



0.02 70 log 



0.0270 



0.2964 x 0.3158 



0.0950 
+0.0950 log , , 



+ 0.0249 log 



0.2964 x 0.3841 
0.0249 



0.2412 x 0.3001 
108 x 0.2928 = 31.6235 



The relative contributions of each of the alleles at the B 
locus to the 0.2928 in D A1 above are (after rounding) 
22.71%, 61.77%, 13.76% and 1.76% for B7, B8, B35 and 
B44 respectively, implying that associations with allele B8 
are responsible for most of the departure of the independence 
model away from association, assuming this last one is 'true'; 
note that the n tj do not enter into the calculations, since the 
frequencies are viewed as true ones. For example, the relative 
contribution of B8 to D Al is arrived at as follows: 



108 



0.2456 log 



0.2456 



+ 0.0402 log 
+ 0.0070 log 



0.2928 x 0.3158 
0.0402 



0.2928 x 0.3841 

0.0070 
0.2928 x 0.3001 



= 108 x 0.1809, 

and then 108 x 0.1809/(108 x 0.2928) = 0.6178. 
Likewise, 

D 1A = -nJ2J2 Pi P' [og {fp-) = - WS x (-°- 4592 ) 
= 49.5987, 

and the relative contributions of the four B-locus alleles to 
D lA are 16.80%, 53.93%, 12.11% and 17.16% for B7, 
B8, B35 and B44 respectively, with B8 playing a major 
role again. Similar calculations can be done for the A 
locus. The KL distance is 31.6235 + 49.5987 = 81.2222, 
and the KL distance per haplotype scored (IV =108) is 
81.2222/108 = 0.7521. The indexes of association are 
9 = ff-ffff fa 0.61 and 1 - 6 fa 0.39. 



Table 2 Hypothetical data for major histo- 
compatibility loci HLA-A and HLA-B, with 
three and four alleles respectively 3 The p,y 
are haplotype frequencies; p L = Y^p-,\ anc ' 

4 /=1 
p.i = YLPi\ S' ve the- allelic frequencies. The 

;=1 

hypothetical number of observed haplotypes 
is assumed to be 108. 
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A3 
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B7 


Pn 


= 0.0270 


P12 = 


0.0950 


P13 = 


0.1743 


pi. = 0.2964 


B8 


P21 


= 0.2456 


P22 = 


0.0402 


P23 = 


0.0070 


p 2 . = 0.2928 


B35 


P31 


= 0.0106 


P32 = 


0.0651 


P33 = 


0.0939 


p 3 . = 0.1696 


B44 


P41 


= 0.0325 


P42 = 


0.1838 


P43 = 


0.0249 


p 4 . = 0.2412 


Marginal 


P.1 


= 0.3158 


P.2 = 


0.3841 


P.3 = 


0.3001 


1.0 (n = 108) 
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In an Rx C table of alleles, a correlation has little mean- 
ing and is not invariant with respect to how alleles are 
scored (e.g. 0, 1, 2) at each of the intervening loci. On the 
other hand, the chi-square statistics of Hill (1975) and 
Zhao et cd. (2005) are also invariant and have a simple 
interpretation in terms of deviations expected under the 
null distribution. Our parameter 6 is also score invariant, 
is independent with respect to the order of rows and col- 
umns, and applies to any distribution. For instance, if the 
simple multinomial sampling model is violated due to, e.g. 
over-dispersion, an appropriate 6 could be developed 
under such model, whereas a 'normalized' \ 2 would pro- 
vide a crude, albeit probably useful, approximation. 

The 'relative distance' metric takes the form 



p(x)p(y) 



with 



R C 

: EE 
i=i j=i 



R C / \ n,i , \ R C 

nnfe) (,„„„;: Jnrw 



and 



^=M<)=ee nnrf 

i= i j= i l i= i j= i V P'l 
n \ R C 

nxi,n 2 i,...,n RC J tijLi 



We evaluated D' M , D' M and ff = D' M /(D' M + D'^j) using 
Monte Carlo sampling. In the calculation of D' AI one mil- 
lion random trials of size 108 each, with {n^} varying at 
random, were sampled from a multinomial distribution 

R C / \ mj 

with probabilities Ptf, then, ]] Q IzrjH was evaluated 

i=i j=i ^ ' ' 

for each realization and averaged over trials. Similarly, 
one million trials of size 108 were sampled from a multi- 
nomial distribution with probabilities p L p h and the realiza- 

R C 

tions of Tl FF f^ 1 ) were averaged out. The index 8' 



nn(«) 



was estimated using the mean and median of the draws of 
0', yielding 0.9953 and 0.9998. The association (disequi- 
librium) between alleles at the A and B loci is patent, both 
when logarithmic (9) and relative (#') distances are 
employed to measure it. 

Systems of multivariate beta processes 

Bivariate beta distributions 



SNPs are correlated (e.g. Akey et cd. 2002; Weir et d. 2005; 
Akey 2009). This is typically due to stochastic dependence 
between alleles at linked loci (producing LD), but evolution- 
ary processes causing dependence among the true frequen- 
cies themselves have been suggested; see Ohta (1982) for a 
population genetics model based on this concept. For exam- 
ple, a correlation between alleles at randomly drawn pairs 
of loci arises if alleles are conditionally (given the allelic fre- 
quencies) independent but with allelic frequencies varying 
at random according to a beta distribution. 

Suppose that pairs of loci are sampled at random from 
some conceptual population of equi-correlated loci and 
that all pairs of 'true' frequencies define a probability dis- 
tribution. Wright (1937) found that a beta distribution 
arose from a diffusion equation used to study changes in 
allele frequencies in finite populations, so the beta process 
is well grounded in population genetics. Beta distributions 
have been used in studies of population differentiation as 
mixing processes to create randomness of allelic frequen- 
cies, leading to beta-binomial likelihoods or to unrecog- 
nized posterior distributions that must be resolved using 
sampling procedures, e.g., Holsinger (1999), Balding 
(2003), Beaumont & Balding (2004) and Gianola etal. 
(2010). 

Similarly, it would seem sensible to assume that allelic 
frequencies of pairs of loci within the hypothetical popula- 
tion have an association that can be modelled with bivari- 
ate beta distributions to represent random variation of 
true, albeit unknown, allelic frequencies. For example, 
suppose that there are S half-sib families that can be 
viewed as drawn as random from some population and 
that we are interested in modeling association between 
pairs of allelic frequencies stemming from such sampling 
scheme. Here, distributions proposed by Sarmanov (1966) 
and Olkin & Liu (2003) are discussed and generalized as 
candidate processes that could be used to model this type 
of association. 

Olkin-Liu bivariate beta distribution 

First, we review how a specific bivariate beta distribution 
arises. As in Olkin & Liu (2003), let U, V and W be ran- 
dom variables following independent standard Gamma dis- 
tributions with parameters a, b and c respectively. By 
construction, X = and Y = y^— possess the beta dis- 
tributions X: Beta(a.c) and Y: Beta(b,c) respectively. 
Clearly, X and Y are positively correlated (through W), 
with a strength that depends on the values of a, b, c. 
Olkin & Liu (2003) show that X and Y have a bivariate 
beta distribution with density function 



A common finding in studies of population differentiation 
using Wright's F-statistics (Wright 1931; Cockerham 
1969) with single nucleotide polymorphisms (SNPs) is that 
estimates of allelic frequencies within a bin of adjacent 



r(a + b - 



c)x a - 1 y h - 1 (l 



x) i,+c ~ 1 (l-y)" +c ~ 1 



r(«)r(&)r(c)(i - *b) 



a+h+c 



(14) 
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Where 0<X<1 and 0< Y< 1. Moments E{X k Y l ) cannot 
be written in closed form but can be approximated 
numerically or using sampling methods. Large c and small 
a, ft produce correlations close to 0, whereas large a. ft or 
small c produce correlations close to 1 (Olkin & Liu 
2003). For example, a correlation equal to 0.002 is 
obtained for n = ft = 0.01 and c=5, whereas the correla- 
tion is 0.91 for a = 2.5, ft = 4 and c = 0.1. Figure 4 
displays scatter plots of 5000 samples obtained from each 
of four bivariate beta distributions. Plot 1 represents a dis- 
tribution in which the correlation is very low, and yet, 
there is considerable association between pairs of values 
near 0, illustrating inadequacy of correlation to reveal 
association. In plot 2 (resembling a 'meteorite') clustering 
takes place primarily tit large values of X and Y. The two 
bottom plots depict bivariate beta distributions with simi- 
lar correlations but with a completely different pattern of 
association. Clearly, correlation often fails as measure of 
statistical association. 

In this bivariate beta model the density ratio takes the 
form 



p(x,y) ^ T(a + b + c)T(c)(l-x) h (l-y) a 
x " p(x)p(y) T(a + c)T(b + c) (1 - xy) a+b+c ' 1 ' 

Where p(x) and p{y) are the beta densities of X and Y: 
Beta(b,c). Then 



1 . Bivariate beta sample 
(a=0.01 b=0.01 c=5) 
Cor=-0.00 




— r~ 

0.1 



i 1 — 

0.2 0.3 
X 



3. Bivariate beta samples 
(a=10b=10c=10) 
Cor=0.50 




~i 1 1 r 

0.2 0.4 0.6 0.8 
X 



2. Bivariate beta samples 
(a=2.5 b=4c=0.1) 
Cor=0.80 




0.2 0.4 0.6 0.8 1.0 



4. Bivariate beta samples 
(a=0.5 b=0.5 c=0.5) 
Cor=0.46 




l 1 1 1 1 r 

0.0 0.2 0.4 0.6 0.8 1.0 
X 



Figure 4 Scatter diagrams of 5000 samples from each of four Olkin- 
Liu bivariate beta distributions. Plot 1 illustrates a strong association 
with essentially no correlation. Plot 2 ('meteorite') depicts a limitation 
of the correlation as a parameter for describing association. Plot 3 
suggests association clearly. Plot 4 shows a bivariate distribution that 
is not trivial: the true correlation (0.46) arises primarily due to weaker 
association in the 'middle' of the bivariate sampling space. 



D /U = log 



T(a + b + c)T(c) 



r(a + c)T(b + c) 

(1 -*)'(!-»)' 



log- 



log 



1 - xy ) 
r<a+b + c)T(c 



a+h+c 



r(a+c)T(b + c) 

{l-x) b (l- y y 



log- 



(1 - xy) 



a+h+c 



so that 



KL : 



{fclog(l-*) + fllog(l-y) 



c p(iea) 

(a+b + c) log(l-jcj/)} 



J i>Ml>(!/) 



{6Iog(l-Jc) + alog(l-«)- 



(a+6+c)log(l -xy)}. 

(16) 



The corresponding D'^j, D' M and KL' can be calculated 
by taking expectations of density ratios as opposed to 
expectations of their logarithms. 

Indexes of association based on D' Ah D' IA and KL' were 
calculated for the following combinations of parameters, 
where p denotes the expected correlation: 1) a — 0.01, 
ft = 0.01. c=5 (p = 0.002); 2) a = l, ft = 0.5, c = 0.6 
{p = 0.251), 3) a=l, 6 = 1, c = 0.9 (p = 0.496) and 4) 
a = 2, ft =3, c = 0.4 (p = 0.750). Expectations under the 
independence model were approximated by drawing 1 mil- 



lion random numbers from independent X ~ Beta(a,c) and 
Y ~ Beta(b,c) distributions, and then averaging the evalua- 
tions of the appropriate expressions over the draws. In the 
association model, 1 million random triplets (U,V,W) were 
drawn from independent standard gamma variables with 
parameters a, ft and c. to then form bivariate realizations 



and y 



Table 3 gives the results, and all 



quantities (save for the true p%y) are Monte Carlo esti- 
mates. The indexes of association 9' and 1 — 9' departed 
from 0.5 and approached 0 and 1, respectively, as the 
correlation in the bivariate beta distribution became stron- 
ger. Likewise, 7' and 7" drifted away from 0, approaching 
1 1 1 . In settings (3) and (4), 9' and 1 — 9' suggest a stron- 
ger association than what would be indicated by p. 

Olkin-Liu bivariate beta-binomial distribution 

In studies of LD, focus is typically on correlation between 
alleles rather than between frequencies. We discuss how a 
correlation between alleles can arise when the association 
stems from frequencies, as could happen when the pop- 
ulation consists of clusters of individuals resulting from 
some family structure. Consider bi-allelic locus i with alle- 
lic frequencies Pr(A ; ) = p ( and Pr(n f ) = 1 — pf. A] and n ; 
denote the two alleles at locus /. Suppose that a sample of 
N individuals is scored, and that the observed number of 
copies of A; and «j are ra, and n an respectively, with 
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Table 3 Departures of Olkin-Liu bivariate beta distributions from 
independence as measured by D' AI , D' IA and by the indexes of associ- 
ation 9' and 1 - 6' (when 8' = 1 independence holds), for four combi- 
nations of parameters a, b and c. p xy is the Monte Carlo estimate (1 
million samples) of the true correlation p XY . Means of variables X and 
V are estimates from samples from association (A) or independence 
(/)models. The D' A/ and D' /A parameters are estimates under the 
appropriate asumptions; KL' = D' AI + D' IA ; 9'= D', A /(D' A i + D', A ); j' 
= 28'- 1; 7 "=1 - 2q\ 







a = 0.01 




a = 0.1 
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ave. 
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0.8825(A) 
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D' AI 




0.9999 
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0.0000 
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-0.9995 
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0.0000 




0.1037 


0.6926 


0.9995 



would be in LD, marginally, even though being in condi- 
tional (given the allelic frequencies) equilibrium. If the 
unobserved, 'true', allelic frequencies follow some bivariate 
distribution with density g(pi,p 2 \Q), where 0 denotes its 
parameters, the density of the marginal distribution of the 
observed number of allelic counts under this association 
model would be 



Pa (DATA) = 



n 

1 1 



2N 
riA, 



1 1 

/ /^(i- Pl r^(i-P2) 

0 0 



0(Pi,P2\O)dpidp 2 . 

(19) 

If a bivariate Olkin-Liu beta distribution with parameters 
0 = (fl,ft,c) is assumed, the joint density of allelic frequen- 
cies p! and p 2 as in (14). Letting 

n u \ - rfo + ft + c) 
HWJ - r(fl)r(ft)r(c) , 

the marginal distribution (19) takes the form 



PA (DATA) 



n( 2N 



C(a, b, c) 



tia, + n a , = 2N. Given p;, the distribution of n Al is Binomial 
(2N,p;). Now, let the allelic frequency p; vary at random 
among clusters (e.g. sub-populations) according to a beta 
distribution with parameters (a,c). Then, the marginal dis- 
tribution of the observed number of alleles is beta-bino- 
mial (e.g., Casella & George 1992; Sorensen & Gianola 
2002), that is, the probability of observing n Al copies 
among the 2N alleles scored is 



Pr(n A ,|2N,a,c) 



2N\ r(a + c) T(n Al + a)r(n fl , + c) 
n Al )r(a)r(c) T{2N + a + c) 

(17) 



Consider now a pair of loci, and let the observed num- 
ber of copies of the four alleles be represented as 
DATA = (n Al , n ai , n Al , n ai ); haplotype frequencies are not 
relevant in the model that follows. If, given the allelic fre- 
quencies p = {p;}, the number of copies of Ax and A 2 are 
independently distributed, i.e., the two loci are in linkage 
equilibrium, the distribution of the observed number of 
alleles is 



I(DATA|p) = | ! 



2N\ n A 



■Pi)""- 



(18) 



Suppose now that this pair is a realization from a sto- 
chastic process where loci are sampled over a conceptual 
population of pairs, e.g. the two loci are drawn at random 
from a physical 'bin' of a certain length in kilobases. This 
process generates a covariance structure between allelic 
frequencies of pairs of loci within the 'bin'. The alleles 



// 



'i 1 P 2 (1 



Pi 



pi) n ai+ l, +C -l {1 



P2 



n„2+H+c-l 



0 0 

dp 1 dp 2 . 



(1 ~P\P2 



.a+b+c 



(20) 



This integral cannot be written in closed form, but it 
can be evaluated using Monte Carlo methods such as 
importance sampling (e.g. Sorensen & Gianola 2002). One 
possible scheme is outlined in the Appendix. 
Under this distribution 

Pa (DATA) 



a, 



pj(DATA) 



Where p/(DATA) is the distribution under independence 
between observed allele numbers, and given by the prod- 
uct of two beta-binomial distributions, that is 

Pi (DATA) - r(fl)r(c) r(2N + « + c ) " 



2N\ r(b + c) T{n Al + h)T{n ai + c) 
n A jT{b)T{c) r(2N+b + c) 



(21) 



Then 



r(n + b + c)r(c) T(2N + a + c)r(2N + b + c) 



r(a + c)r(b + c) r(2N + a + b + c 
T{n Al +a)T{n ai +b + c) 



r(n Al + a)r(n ai + c)r(n A2 + b)r(n a2 + c] 



E Bl ,B 2 [h{pi,p 2 ) 



(22) 
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where E Bl)B2 [h(pi,p 2 )] is defined in the Appendix. Pre- 
ceding the expectation, the first term involves the parame- 
ters of the bivariate beta distribution, the second involves 
sample size (N) as well, and the third one depends on 
observed allelic counts. 

The logarithmic and relative distances between distribu- 
tions are calculated as 

2N 2N 

D M = £ £log )p A (DATA), 

nj, =0 n Al =0 
2N 2N 

DlA = - E E log(^ in , 2 )w(DATA), 

n Al =0 n Al =0 



and 



D ai = E E ^"..^(DATA), 

njj =0 n,i 2 =0 



D m = E E <„^(DATA), 

i/l! =0 nj 2 =0 

With 9 and 6*' calculated as before. Calculations proceed 
on an entirely numerical basis. 

A generalization of the beta process to a situation with 
more than two loci is in Olkin & Liu (2003). For K loci 
the joint density of the allelic frequencies takes the form 



g(pi,P2,---,PK\Q) 



r(«i + a 2 



a K ) 



fl r( fli ) 



fi(i-Pi 



-(oi+a2+...+a K ) 



The marginal distribution of allelic counts is obtained 
by mixing a multinomial distribution over the multivariate 
beta density above, but results are not available in closed 
form, so a numerical solution is needed again. 

Lee-Sarmanov bivariate beta distribution 

Lee (1996) proposed a different bivariate beta distribution 
following Sarmanov (1966); see Danaher & Hardie 
(2005). A Lee-Sarmanov bivariate beta distribution for 
allelic frequencies pi and p 2 has density 

9(Pi,P2\0*) = Bi(ai,bi)B 2 {a 2 ,b 2 )[l + w(pi - n 1 )(p 2 - n 2 j\, 

(23) 

Where Bi(.) and B 2 (.) are beta densities as before and w is 
a parameter (w = 0 produces independence); Q* = (a 1 ,b 1 , 



a 2 ,b 2 ,w). Above, fi x 



and n 2 



a 2 +b : 



are the expected 



values of p\ and p 2 respectively. The marginal densities of 
Pi and p 2 under this model are B x (.) and B 2 (.). Further, 

E(pip 2 ) = \i-i\i 2 + wFar(pi)Var(p 2 ), 



Cov(p 1 ,p 2 ) = w^ar(pi)Var(p 2 ) 



so that 



with 



Corr(pi,p 2 ) = Wv/Var(pi)Var(p 2 ), 



fl f + bi + 1 

In the Lee-Sarmanov bivariate beta distribution, a null 
correlation implies independence. Under this model the 
density ratio becomes 

= B l{ai M)B 2 (a 2 ,b 2 ) = [1 + WiPl ~ " l)iP2 ~ 



D A i = E 9 {log[l + w(pi - n 1 )(p 2 - fi 2 )]}, 
where E g denotes expectation under association and 

D M = -E BliB2 {log[l + w(pi - fi 1 )(p 2 - fi 2 )]}, 

is the corresponding expectation under independence of 
the distributions. Further 



D' 



/ / [1 + w( Pl ~ fi^ipi - n 2 )} 2 
■Jo Jo 

Bi(a 1 ,b 1 )B 2 (a 2 ,b 2 )dp 1 dp 2 



= l + Corr 2 (p 1; p 2 ) (24) 
is in closed form, while 

D/a = B Bl B 2 [l + w( Pl - [il)(P2 - H 2 T\ (25) 

can be estimated using Monte Carlo sampling, assuming 
this expectation exists. Hence 

1 + Corr 2 (p u p 2 ) +E BlBl [l + w(pi - n 1 )(p 2 - n^]' 1 

The index 9' was evaluated for three sets of values of 
the a,b parameters for each of the two intervening beta 
distributions while setting the correlation at 0,|,| or jg. 
Metric D' Al in (24) follows directly from the correlation, 
while D' M was estimated using 1 million random samples 
from each of the two beta distributions. For each set of a,b 
parameters w was calibrated via 



Corr(pi,p 2 ) 
V'Var(pi)Var(p2 



= Corr(pi,p 2 ; 



(ai + hi) 2 (fli + b 1 + l)(a 2 +fc 2 ) 2 (fl 2 + (7 2 + l) 



VL\b\VL 2 b 2 



Note that D'a ; is solely a function of the correlation, so it 
was the same for the three pairs of beta distributions exam- 
ined. Using (25), D'i A was estimated by averaging 
a pik = I 1 + w (Pi _ i"l)(P2 - fc)] 1 over draws. Because 
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Vik is a ratio between density functions, negative realiza- 
tions were not used when forming the Monte Carlo esti- 
mate (this produces an upward bias but a more precise 
estimate). As shown in Table 4, the KL distance between 
the bivariate beta distribution and the independence pro- 
cess increased monotonically with the strength of the cor- 
relation. The indexes of association 9' and 1—9' also 
drifted away monotonically from \ only in setting 2. For 
instance, in setting 1 9' increased from 0.50 to 0.66 as the 
correlation increased from 0 to |, but decreased to 0.60 
when the correlation was jg. The shape of the bivariate 
beta distribution was examined for this setting by plotting 
the density at 20 000 random points drawn from each of 
the marginal Be£a(2,2) distributions. This is shown in 
Fig. 5: the Lee-Sarmanov density 'evolves' towards a more 
complex topography as correlation increases. When corre- 
lation grows from 0 to |, the shape of the joint distribution 
is not too different from that under independence, as also 
suggested by KL and 9'. However, when the correlation 
increases by a factor of 3 from \ to |, the KL distance grows 
only about two times, while 9' and 1 — 6' move away from 
i at an even slower rate; this is expected because 9 has an 
upper bound at 1, whereas KL takes any positive value in 
the real line. The bimodal shape of the density also suggests 
that more intensive sampling is needed to obtain reliable 
estimates of the indexes of association. 

As shown in Danaher & Hardie (2005), the marginal 
distribution of the observed allelic counts DATA = 
(n Al ,n ai ,n A2 ,n a2 ) is obtained by mixing (18) over the 
Lee-Sarmanov density (23), leading to 



p A (DATA) = H 
l=i 



( 2N\ r( fl) + In) T{n Al + ai)T{n ai + h) 

\n A J r( fl; )r(b ; ) r(2N + «, + &,) 



1 + w 



(n Al -Wn{)(n Al -2Nn 2 ) 



(«i + h 



(a 2 + h) 



(26) 



Using the fact that under independence the joint distribu- 
tion is product beta-binomial as in (21), it turns out that 



a, 



1 + W 



[n Al - 2NHi) (n Al - 2Nn 2 ) 



(«i + h) (a 2 + b 2 ) 
so that closed forms are available for 



E E 

11^=0^=0 L 



1 + w 



{n Al - 2Nhj) (n Al - 2Nn 2 ) 



(«i + h 



(fl2 + b 2 ) 



n 



2N\ r(a, + b,) T(n Al + fl;)r(n a , + b{] 

n A , J r( fl; )r(b ; ) r(2N + «, + &,) 



and 



E E 

n Ay =0 n &2 =0 



1 + W 



(n Al - 2Nn x ) (n Al - 2N/i 2 ) 



(«i + In 



(«2 + b 2 



Table 4 Departures of Lee-Sarmanov bivariate beta distributions with 
varying correlation from independence, as measured by D' Ah D) A and 
by indexes of association q' and 1 - q' (when 8' = \ independence 
holds). The four correlations levels depend on the value of parameter 
w of the bivariate distribution, and on parameters a,b of the two 
'parental' beta distributions. D' AI has closed form and D', A was esti- 
mated using 1 million samples from the beta distributions. KL'=D 
■ AI + D' AI ; q' = D' IA /(D' AI + D' IA ). 



Setting/Item D' AI 



(% negative) KL' 



(1) a-, =b-,=a 2 
Correlation (w) 

0(0) 

J (5) 
1(15) 

(2) a, = 10; b, - 
Correlation (w) 

0(0) 

K 281 ) 
1(842) 

^(1010) 

(3) a, = .25; b. 
Correlation (w) 

0(0) 

?( 135 ) 
|(404) 

to( 485 ) 



1 

1.0625 
1.5625 
1.8073 

= 90; a 2 = 

1 

1.0625 
1.5625 
1 .8073 
= -75; a 2 

1 

1.0625 
1.5625 
1.8073 



1 

1.2681 (0.0) 
3.0985 (8.5) 
2.7369 (11.2) 
90; b 2 = 10 

1 

1.1466 (0.4) 
2.6824 (6.3) 
4.4937 (8.7) 
.75; b 2 = .25 

1 

1.3766 (0.9) 
1.5094 (4.9) 
1.5876 (5.6) 



2 

2.3307 
4.6610 
4.5468 



2 

2.2091 
4.2449 
6.3037 



2 

2.4391 
3.0719 
3.3976 



0.5000 
0.5441 
0.6648 
0.6019 



0.5000 
0.5190 
0.6319 
0.7128 



0.5000 
0.5644 
0.4914 
0.4673 



0.5000 
0.4559 
0.3352 
0.3981 



0.5000 
0.4810 
0.3681 
0.2871 



0.5000 
0.4356 
0.5086 
0.5327 



n 



2N\ r(a; + bi) T{n Al + «,)!>„, + hi 



n Al J r(a,)r(b,) r(2N + a, + b ( ) 



The correlation between number of copies of the A 
alleles at the two loci is (Danaher & Hardie 2005) 



Corr(n Al ,n Al ) = 2Nw 



Var(p!)Var(p 2 



(fli + (5i+2N)(a 2 + ft 2 + 2N) 



Var(p!)Var(p 2 



K 1 2N )\ X "•" 2N ) 

so that for large N 

Corr(rc 4l ,n A2 ) wV , Var(pi)Var(p 2 ) = Corr(p!,p 2 )- 

A multivariate generalization of (26) is direct but 
appears that it has not been reported elsewhere. The joint 
density of the allelic frequencies at L loci takes the form 



g(pi,P2,---,PL\6*) 

L 



L-l L 

i + E E w v( pi ~ <"i)( p2 ~ ^2) 

i=l ;=i+l 



(27) 



It is straightforward to verify that this density integrates 
to 1, and that all marginal distributions are beta. Now, let 
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BB,K|a ( ,&,,2N) = jfY^ i i^(l-B)^fl i fo ) to)<to 

2N\ r(g) + bi) T(n A , + fl))r(n„, + b,) 
n Al ) T(a,)r{b,) T(2N + «( + &,) ' 

denote the beta-binomial distribution of the observed 
number of copies of allele At, out of 2N copies. The mar- 
ginal distribution of allelic counts over all loci can be 
shown to be, after algebra 



Pr(n Al , n Al , . . .,n Al \q*) = BBi(n A ,\ai, k, 2N) 



1-1 



(n Al - 2N^)(n A . - 2Afy) 



i=i j=i+i 



(fl i + ft i + 2N)(n, + /;, + 2N) 



(28) 



We propose (for obvious reasons) to name (27) and 
(28) as multivariate beta GMS-Sarmanov and multivariate 
beta-binomial GMS-Sarmanov distributions, respectively. 

Discussion 

The problem addressed in this study was that of measur- 
ing departures of the joint distribution of genetic variables 
from independence. New measures of association based on 



notions of statistical distance were proposed and evaluated 
under several scenarios, spanning from multivariate 
Gaussian distributions modeling, say, pleiotropic effects, to 
systems of beta distributions describing association 
between allelic frequencies. Two hereto seemingly unre- 
ported probability distributions were derived, and termed 
GMS-Sarmanov multivariate beta and GMS-Sarmanov 
multivariate beta-binomial distributions. The standard LD 
problem was also dealt with to illustrate the generality of 
the approaches proposed. 

Linkage disequilibrium analysis has been the subject of 
an enormous amount of research, re-energized with the 
availability of massive molecular marker and sequence 
data. For example, in the context of coalescent theory, 
an important issue is the decoupling of ancestries of sites 
at different regions of the chromosome, and this is done 
by studying association between alleles at different loci 
(Wakeley 2009). Most of the standard measures of LD 
employed are pairwise statistics such as correlations (e.g. 
Hill & Robertson 1968; Hill 1974; Hedrick 1987; Lewon- 
tin 1988; Morton etal. 2001; Pritchard & Przeworski 
2001; McVean 2007), because of ease of calculation and, 
perhaps, ease of interpretation. However, as pointed out 
by Lewontin (1988), although a correlation is a mean- 
ingful parameter (e.g., in terms of the amount of variabil- 
ity of Y explained when X is observed) in a bivariate 
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normal distribution, it is arguably less so when applied to 
discrete data, a problem that is well known in quantita- 
tive genetic analysis of discrete phenotypes (Dempster & 
Lerner 1950; Gianola 1982). Further, pairwise measures 
do not characterize association well if a given genetic sys- 
tem involves many correlated random variables, as in 
multi-locus measures of LD. Actually, such measures 
have been reported much less often, e.g. Weir (1996) 
gives formulae for up to four loci, and Sabatti and Risch 
(2002) give a measure based on excess of heterozygosity 
or of homozygosity. 

Nothnagel et al. (2002) presented an entropy-based 
index of LD that is related to our approach, but that differs 
in some respects. These authors calculate the entropy of 
the allelic distributions under linkage equilibrium and 
under disequilibrium, and express the difference in entropy 
as a fraction of the equilibrium entropy. This produces a 
normalized entropy that takes values between 0 and 1 (0 
indicating no association between alleles at different loci). 
A formal objection is that, while entropy measures uncer- 
tainty in a distribution, its relationship to statistical dis- 
tance between distributions (Kullback 1968) requires 
more elaboration. Their method is for bi-allelic loci only, 
and entropy does not always behave well in continuous 
distributions (Bernardo and Smith, 1994; Sorensen & 
Gianola 2002), whereas relative entropy measures such 
as the KL distance are well defined. 

Closer to the spirit of our procedures, Liu & Lin (2005) 
suggested a measure of LD based on a relative KL discrep- 
ancy, but it differs from the ones we propose. This is 
mainly because they use only one of the two components 
(termed D AI in our paper) of the invariant KL distance, 
and express it relative to the maximum value it can take. 
Using their procedures with D lA would produce a different 
value of association, and maybe a different qualitative pic- 
ture might emerge from analysis of genetic data. However, 
their ideas can be embedded in our approach, and express- 
ing association relative to a maximum distance is a well 
taken point. It also turns out that these results can also 
be adapted to the continuous domain, with some care. To 
illustrate this, it suffices to consider two random vectors, x 
and y, as generalization to a higher-dimensional system is 
straightforward. Let H(x), H(y), and H(x,y) be the entro- 
pies (non-negative, by construction, although pathological 
examples may arise in continuous distributions) of the 
marginal distributions of x, y, and x,y respectively, with 

H(x) = - J[logp(x)]p(x)dx, 
H(y) = - /[logp(y)]p(y)dy, 

and 

H(x,y) = - J [\ogp(x,y)]p(x,y)dxdy. 



The discrepancy away from the association model is 



log 



p(x,y) 



p(x)p(y) 



p(x,y)dxdy =H(x) + H(y) - H(x,y). 

(29) 



Note that 



H(x,y) = - J[\ogp(x) + logp(y|x)]p(x)p(y|x)dxdy 

= H(x)+H(y\x), 

Where H(y | x) is the entropy of the conditional distribu- 
tion of y given x. This implies that H(x, y) =: H(x), where 
x can also be any partition of x, e.g. its ith coordinate x t , 
say, and similarly H(x, y) 2; H(y). Thus, H(x, y) Si maxj H 
{Xi). Using this in (29) 

D AI =H(x) + H(y) - H(x, y) < D A[max 

= H(x) + H(y) - max[H(xi),H(tf,)]. 
j 

More generally, for a vector z with n elements, the KL 
discrepancy away from the independence model is 



Dm = ^ H ( z d ~ H(zi,z 2 , . . .,z„) <D Mr] 

i=l 
n 

= ^H(z i )-max[H(z i )]. 



(30) 



Likewise 



D 1A = J lo; 



p(x)p(y) 



i?(x,y) 
[H(x)+H(y) 



where 

C{p(x)p{y),p{x,y)) 



p(x)p(y)dxdy =C(x,y) 



[logp(x,y)]p(x)p(y)dxdy. 



is the cross-entropy between distributions. From the 
expressions for D AI above, H(x) + H(y) ^ max t [H(x t ), H(yd], 
so that 

D M = C(x,y)-[H(x) + H(y)]<D M 

max 

= C(x,y)-max[H(x,),H(tf 1 )]. 
i 

This leads to additional indexes of association, such as 
Dm 



D; 



IA max Dai max 



and 



Dia + Dai 



KL 



max + D M max ^raax 



(31) 



(32) 



both taking values between 0 and 1. These indexes were 
evaluated for a bivariate normal distribution with correla- 
tion p = jjj and marginal distributions x: Ni(0,l) and y: 
N 2 (0,1). One obtains 
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H(x) =H(y) =-(l+log2jt) « 1.4189, 
D IA = p 2 (1 - p 2 )~ 1 + log Vl-P 2 « 0.3394, 



D 4i = - log Vl-P 2 ~ 0.2231, 

KL = 0.3394 + 0.2231 = 0.5625, 

0 3394 
* = 0^625 -°- 6 ° 33 ' 

DAimax = 2 x 1.4189 - 1.4189 = 1.4189, 
C(x,y) = -log ? L= + r ^= 3.1772, 
Dm max = 3.1772 - 1.4189 = 1.7583, 



0.3394 



1.4189 + 1.7583 



and 



0.5625 



1.4189 + 1.7583 



0.1068, 



0.1770. 



The four indexes, p, 6, 9** and q** produce a different 
value of the strength of association between random vari- 
ables in a distribution. This suggests that probably there is 
no such thing as a universal measure, although it is clear 
that the coefficient of correlation lacks generality. 

In a nutshell, this paper presents measures of associa- 
tion for systems of genetic variables that go beyond stan- 
dard two-dimensional statistics. The procedures apply to 
either continuous or discrete data, and typically require 
numerical implementation, because many of the expres- 
sions are not available in closed form, depending on the 
distribution assumed. Our procedures (like any other 
method) require knowledge of the parameters of the distri- 
butions under independence or association, and estimation 
of such parameters is not the objective of this paper. 
Today, computer-intensive approaches for parameter infer- 
ence, such as Bayesian Markov chain Monte Carlo (Soren- 
sen & Gianola 2002; Gelman etal. 2004) or approximate 
Bayesian computation (Beaumont et al. 2002), can be 
implemented effectively in today's high throughput sys- 
tems (Wu etal 2011). 

Finally, since this volume is in honor of the contribu- 
tions of Professor Moshe Soller, a relationship between this 
paper and his work should be established. As noted at the 
onset of this document, among many other accomplish- 
ments, he pioneered marker-assisted selection in animal 
and plants via exploitation of linkage and LD relationships 
between markers and unknown quantitative trait loci. 
Examples of his papers in this area include Soller & Genizi 
(1978) and Lipkin etal. (2009), and these connect with 



some of the developments presented here. We look for- 
ward to many more scientific accomplishments by Moshe! 
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Appendix 



Bivariate and univariate t-distributions 

The density of a K-variate random vector following a mul- 
tivariate t-distribution is 



K*|A«,£,v) 



r(t)(* 



i + 



(33) 



where fi =E(x) , £ is the scale matrix, -u > 0 is the degrees 
of freedom parameter and T(.) is the Gamma function; 
the covariance matrix of this distribution is Eu/(u — 2). 
The specification k=2, u = 0 and S = I 2 , yields a bivari- 
ate t-distribution with null mean and covariance matrix 
Var(x) = l2v/(u — 2); the two marginal distributions are 
univariate-t, each having a null mean, variance v/(v — 2) 
and v degrees of freedom. Although uncorrelated, these 
two random variables are statistically dependent because 
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The draws from the bivariate t-distribution needed for 
evaluating D' A1 and D' lA are obtained as 



X\ 




Zl 




x 2 _ 




_2 2 _ 





Under independence, the sampling from 2 univariate t- 
distributions is done as 



Above, the zs are independent draws from N(0, 1), Xv 
is a draw from a central chi-square distribution on v 
degrees of freedom, and y^ x and y} vl are two independent 
draws from the same distribution. 



I(n Al , n Al \a, b, c) = Bj (n 4l + a, n fll + b + c) 

B 2 [(n A2 +b)p 1 ,(n a2 +a + c)(l-p 1 )]. 

(34) 

Distribution B 2 depends on p x , and has expectation 

(n Al + b) 

n Al + b + {n a2 + a + c) ' 

which would be equal to the mean value under indepen- 
dence only if p! = 0.5. The double integral in (20) can be 
represented as 



// 

o o 



Pi 



(i-pmr^B^ 



B 1 B' 2 dp 1 dp 2 

= r(n Al +fl)r(n fll + b + c) 
T(2N + a+b + c) 



E BlB2 [h(p 1 ,p 2 )}, 



where 



h( x r[( nA2 +b)p 1 ]r[( fla2 + fl + c )(i- Pl )]p("^ +6 ) (l ''' l) (i-p 2 )( 

lPl,P2j r[(^ 2 +I»)pi+(n«2+fl+c)(l-i>i)] (I-P1P2 



"a2+ a + c jPl 
a+b+c 



Olkin-Liu bivariate beta-binomial 
distribution: sampling scheme 

When allelic frequencies are independent, their joint den- 
sity is the product of the densities of the beta random vari- 
ables Bi (n Al + a, n ai + b + c) and B2 (n Al +b,n a2 + a + c) . 
However, this does not provide a suitable importance sam- 
pling distribution, because dependency would not be rec- 
ognized at all, with the sample space not visited 
appropriately. A form of introducing dependency is via an 
importance sampling density with two dependent beta dis- 
tributions, such as 



and Eb 1 b 2 denotes expectation with respect to the joint 
distribution with density B\B' 2 . Using this in (20), the 
marginal distribution of the data under the association 
model is 
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Expression (35) is a seemingly unreported discrete prob- 
ability distribution which we term 'Olkin-Liu bivariate 
beta-binomial distribution'. 
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